5 research outputs found
Impact Factor: outdated artefact or stepping-stone to journal certification?
A review of Garfield's journal impact factor and its specific implementation
as the Thomson Reuters Impact Factor reveals several weaknesses in this
commonly-used indicator of journal standing. Key limitations include the
mismatch between citing and cited documents, the deceptive display of three
decimals that belies the real precision, and the absence of confidence
intervals. These are minor issues that are easily amended and should be
corrected, but more substantive improvements are needed. There are indications
that the scientific community seeks and needs better certification of journal
procedures to improve the quality of published science. Comprehensive
certification of editorial and review procedures could help ensure adequate
procedures to detect duplicate and fraudulent submissions.Comment: 25 pages, 12 figures, 6 table
Open-Set Classification for Automated Genre Identification
Abstract. Automated Genre Identification (AGI) of web pages is a problem of increasing importance since web genre (e.g. blog, news, e-shops, etc.) information can enhance modern Information Retrieval (IR) systems. The state-of-the-art in this field considers AGI as a closed-set classification problem where a variety of web page representation and machine learning models have intensively studied. In this paper, we study AGI as an open-set classification problem which better formulates the real world conditions of exploiting AGI in practice. Focusing on the use of content information, different text representation methods (words and character n-grams) are tested. Moreover, two classification methods are examined, one-class SVM learners, used as a baseline, and an ensemble of classifiers based on random feature subspacing, originally proposed for author identification. It is demonstrated that very high precision can be achieved in open-set AGI while recall remains relatively high
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance
Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback- Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n-grams. \ua9 Springer-Verlag Berlin Heidelberg 2009
Using sentence embedding for cross-language plagiarism detection
The growth of textual content in various languages and the advancement of automatic translation systems has led to an increase of cases of translated plagiarism. When a text is translated into another language, word order will change and words may be substituted by synonyms, and as a result detection will be more challenging. The purpose of this paper is to introduce a new technique for English-Arabic cross-language plagiarism detection. This method combines word embedding, term weighting techniques, and universal sentence encoder models, in order to improve detection of sentence similarity. The proposed model has been evaluated based on English-Arabic cross-lingual datasets, and experimental results show improved performance when compared with other Arabic-English cross-lingual evaluation methods presented at SemEval-2017